Background: The CRoss Industry Standard Process for Data Mining (CRISP-DM) is being applied in order to study the admission and discharge patients in emergency department at Hero DMC Heart Institute. Two (2) types of data analytics will be executed for this project which inclusive of Descriptive and Predictive Analytics.
There are several pain points that are experienced in the Hospital Emergency Department as follows:
Assumptions and limitations: Generally, prioritisation of patient admission is based on the triage scale in which this triage is utilised in the healthcare community to categorise patients based on the severity of their injuries i.e. category 1 (immediate), category 2 (urgent) and category 3 (non-urgent). However, for this project, level of triage scale to attend patient during emergency is excluded from the scope as patients records from the dataset comprises categorical data that is categorised into "outpatient" and "emergency" only
Business Goal: To aid the emergency deparment strategy on admission and discharge by providing admission prediction
Data Mining Goal:
Expected Outcome
Project Timeline:
The project has divided the milestone into 5 sprint deliverables and expected to deploy the data products on Week 13 as Release 1.0
There are a total of 15,758 rows and 56 columns available in the datasets which comprises of categorical and numerical data types.
# Run this the first time only, it will download the project file from Github
!git clone https://github.com/samueltan3972/WQD7003-Data-Analytics.git
fatal: destination path 'WQD7003-Data-Analytics' already exists and is not an empty directory.
%ls
Volume in drive D is Hard Disk
Volume Serial Number is B8DE-8541
Directory of D:\UM Master\Sem 1\Data Analytics\Temp
19/01/2023 05:28 PM <DIR> .
19/01/2023 05:28 PM <DIR> ..
16/01/2023 10:40 PM <DIR> .ipynb_checkpoints
19/01/2023 02:16 PM 1,336,646 output.ipynb
16/01/2023 03:38 PM 147 Polynomial.txt
16/01/2023 03:10 PM 1,253 Polynomial_1.pkl
16/01/2023 03:10 PM 20,073 Polynomial_2.pkl
16/01/2023 03:13 PM 237,003 Polynomial_3.pkl
16/01/2023 03:29 PM 2,236,207 Polynomial_4.pkl
19/01/2023 05:19 PM 139,442,728 RFmodel.pkl
19/01/2023 05:28 PM 30,695 WQD7003 Modelling.ipynb
16/01/2023 09:18 PM 44,727 WQD7003 Modelling_Final.ipynb
19/01/2023 05:22 PM 15,766,523 WQD7003_DATA_ANALYTICS_Group_5.html
19/01/2023 05:24 PM 19,287,591 WQD7003_DATA_ANALYTICS_Group_5.ipynb
19/01/2023 05:20 PM 15,771,287 WQD7003_DATA_ANALYTICS_Group_5.slides.html
16/01/2023 10:37 PM 1,989,679 WQD7003_DATA_ANALYTICS_Group_5_old.ipynb
19/01/2023 05:27 PM 5,385,634 WQD7003_output.ipynb
11/01/2023 11:13 PM <DIR> WQD7003-Data-Analytics
14 File(s) 201,550,193 bytes
4 Dir(s) 914,706,296,832 bytes free
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('./WQD7003-Data-Analytics/HDHI_Admission_data/HDHI_Admission_data.csv')
data.head()
| SNO | MRD No. | D.O.A | D.O.D | AGE | GENDER | RURAL | TYPE OF ADMISSION-EMERGENCY/OPD | month year | DURATION OF STAY | ... | CONGENITAL | UTI | NEURO CARDIOGENIC SYNCOPE | ORTHOSTATIC | INFECTIVE ENDOCARDITIS | DVT | CARDIOGENIC SHOCK | SHOCK | PULMONARY EMBOLISM | CHEST INFECTION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 234735 | 4/1/2017 | 4/3/2017 | 81 | M | R | E | Apr-17 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 234696 | 4/1/2017 | 4/5/2017 | 65 | M | R | E | Apr-17 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 3 | 234882 | 4/1/2017 | 4/3/2017 | 53 | M | U | E | Apr-17 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 234635 | 4/1/2017 | 4/8/2017 | 67 | F | U | E | Apr-17 | 8 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 234486 | 4/1/2017 | 4/23/2017 | 60 | F | U | E | Apr-17 | 23 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 56 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15757 entries, 0 to 15756 Data columns (total 56 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SNO 15757 non-null int64 1 MRD No. 15757 non-null object 2 D.O.A 15757 non-null object 3 D.O.D 15757 non-null object 4 AGE 15757 non-null int64 5 GENDER 15757 non-null object 6 RURAL 15757 non-null object 7 TYPE OF ADMISSION-EMERGENCY/OPD 15757 non-null object 8 month year 15757 non-null object 9 DURATION OF STAY 15757 non-null int64 10 duration of intensive unit stay 15757 non-null int64 11 OUTCOME 15757 non-null object 12 SMOKING 15757 non-null int64 13 ALCOHOL 15757 non-null int64 14 DM 15757 non-null int64 15 HTN 15757 non-null int64 16 CAD 15757 non-null int64 17 PRIOR CMP 15757 non-null int64 18 CKD 15757 non-null int64 19 HB 15505 non-null object 20 TLC 15471 non-null object 21 PLATELETS 15472 non-null object 22 GLUCOSE 14894 non-null object 23 UREA 15516 non-null object 24 CREATININE 15510 non-null object 25 BNP 7316 non-null object 26 RAISED CARDIAC ENZYMES 15757 non-null int64 27 EF 14252 non-null object 28 SEVERE ANAEMIA 15757 non-null int64 29 ANAEMIA 15757 non-null int64 30 STABLE ANGINA 15757 non-null int64 31 ACS 15757 non-null int64 32 STEMI 15757 non-null int64 33 ATYPICAL CHEST PAIN 15757 non-null int64 34 HEART FAILURE 15757 non-null int64 35 HFREF 15757 non-null int64 36 HFNEF 15757 non-null int64 37 VALVULAR 15757 non-null int64 38 CHB 15757 non-null int64 39 SSS 15757 non-null int64 40 AKI 15757 non-null int64 41 CVA INFRACT 15757 non-null int64 42 CVA BLEED 15757 non-null int64 43 AF 15757 non-null int64 44 VT 15757 non-null int64 45 PSVT 15757 non-null int64 46 CONGENITAL 15757 non-null int64 47 UTI 15757 non-null int64 48 NEURO CARDIOGENIC SYNCOPE 15757 non-null int64 49 ORTHOSTATIC 15757 non-null int64 50 INFECTIVE ENDOCARDITIS 15757 non-null int64 51 DVT 15757 non-null int64 52 CARDIOGENIC SHOCK 15757 non-null int64 53 SHOCK 15757 non-null int64 54 PULMONARY EMBOLISM 15757 non-null int64 55 CHEST INFECTION 15757 non-null object dtypes: int64(39), object(17) memory usage: 6.7+ MB
data.shape
data.tail()
| SNO | MRD No. | D.O.A | D.O.D | AGE | GENDER | RURAL | TYPE OF ADMISSION-EMERGENCY/OPD | month year | DURATION OF STAY | ... | CONGENITAL | UTI | NEURO CARDIOGENIC SYNCOPE | ORTHOSTATIC | INFECTIVE ENDOCARDITIS | DVT | CARDIOGENIC SHOCK | SHOCK | PULMONARY EMBOLISM | CHEST INFECTION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15752 | 15753 | 699585 | 31/03/2019 | 04/04/2019 | 86 | F | U | O | Mar-19 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15753 | 15754 | 699500 | 3/31/2019 | 4/1/2019 | 50 | M | R | E | Mar-19 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15754 | 15755 | 700415 | 31/03/2019 | 09/04/2019 | 82 | M | U | E | Mar-19 | 10 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15755 | 15756 | 699524 | 31/03/2019 | 03/04/2019 | 59 | F | U | O | Mar-19 | 4 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15756 | 15757 | 699524 | 31/03/2019 | 03/04/2019 | 59 | F | U | O | Mar-19 | 4 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 56 columns
#data[SMOKING']value_counts()
# Permanently changes the pandas settings
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)
# pd.set_option('display.width', None)
# pd.set_option('display.max_colwidth', -1)
data.loc[:, 'CHEST INFECTION'].value_counts()
0 15415 1 341 \ 1 Name: CHEST INFECTION, dtype: int64
data.GLUCOSE.describe()
count 14894 unique 521 top 110 freq 270 Name: GLUCOSE, dtype: object
Automated exploraty data analysis
This step is being done to get quick high level overview on the data
# !pip install sweetviz
# # Run below if facing issue with matplotlib with sweetviz
# !python -m pip uninstall matplotlib
# !pip install matplotlib==3.1.3
import sweetviz as sv
analyze_report = sv.analyze(data)
analyze_report.show_notebook()
# !pip install autoviz
# from autoviz.AutoViz_Class import AutoViz_Class
# AV = AutoViz_Class()
# df = AV.AutoViz('./WQD7003-Data-Analytics/HDHI_Admission_data/HDHI_Admission_data.csv', verbose=0)
# # df = AV.AutoViz(data, verbose=1)
Based on the Data Understanding result above, data cleaning and preprocessing to be performed in order to resolve issues pertaining irrelevant information, errors, missing values, and etc as well as performing data transformation to ensure that the data is suitable for further analysis.
Steps for data cleaning executed are as below: \
# Get Emergency only
data = data[data["TYPE OF ADMISSION-EMERGENCY/OPD"] == 'E']
# Fix the naming format problem
data.rename(columns = {'SMOKING ':'SMOKING'}, inplace = True)
# Drop unnecessary columns
data.drop(['SNO', 'MRD No.', 'D.O.A', 'D.O.D', 'TYPE OF ADMISSION-EMERGENCY/OPD', 'month year', 'duration of intensive unit stay'], axis = 1, inplace = True)
# Check empty value in dataset
def print_empty_value(data, isna_only=False):
df = pd.DataFrame(columns = ['Column_Names', "Num_Of_Empty_Value"])
for i in data:
# sum(pd.isna(data['BNP'])) + sum(data[data['BNP'] == ''].index)
number_of_empty_value = sum(pd.isna(data[i]))
if not isna_only:
number_of_empty_value += len(data[data[i] == 'EMPTY'])
df = df.append({'Column_Names': i, 'Num_Of_Empty_Value': number_of_empty_value}, ignore_index=True)
result = df[df['Num_Of_Empty_Value'] != 0]
if(len(result) == 0):
print("No empty value")
else:
print(result)
return result
missing_value = print_empty_value(data)
Column_Names Num_Of_Empty_Value 12 HB 234 13 TLC 261 14 PLATELETS 262 15 GLUCOSE 763 16 UREA 214 17 CREATININE 217 18 BNP 5541 20 EF 754
# Replace value with 'EMPTY' with none
data.replace('EMPTY', np.nan, inplace=True)
missing_value = print_empty_value(data, True)
Column_Names Num_Of_Empty_Value 12 HB 234 13 TLC 261 14 PLATELETS 262 15 GLUCOSE 763 16 UREA 214 17 CREATININE 217 18 BNP 5541 20 EF 754
# Check value count in every missing value column whether is number
for i in range(len(missing_value)):
value_counts = data[missing_value.iloc[i, 0]].value_counts()
print([row for row in value_counts.index if not row.replace('.', '', 1).isdigit()])
[] [] [] [] [] [] [] []
# Run only once, if run when the label encoding have been applied to dataset, the inverse encoding may fail
from sklearn.preprocessing import LabelEncoder
gender_label_encoder = LabelEncoder()
rural_label_encoder = LabelEncoder()
outcome_label_encoder = LabelEncoder()
gender_label_encoder.fit(data['GENDER'])
rural_label_encoder.fit(data['RURAL'])
outcome_label_encoder.fit(data['OUTCOME'])
print(list(gender_label_encoder.classes_))
print(list(rural_label_encoder.classes_))
print(list(outcome_label_encoder.classes_))
['F', 'M'] ['R', 'U'] ['DAMA', 'DISCHARGE', 'EXPIRY']
# Label encoding in categorical variable, GENDER, RURAL, OUTCOME
data['GENDER'] = gender_label_encoder.transform(data['GENDER'])
data['RURAL'] = rural_label_encoder.transform(data['RURAL'])
data['OUTCOME'] = outcome_label_encoder.transform(data['OUTCOME'])
# Impute data with KNN Imputer
from sklearn.impute import KNNImputer
imputer = KNNImputer()
After_imputation = imputer.fit_transform(data)
imputed_data = pd.DataFrame(After_imputation, index=data.index, columns=data.columns)
print_empty_value(imputed_data)
No empty value
| Column_Names | Num_Of_Empty_Value |
|---|
# Code example to inverse the encoding
# data['GENDER'] = gender_label_encoder.inverse_transform(data['GENDER'])
# data['RURAL'] = rural_label_encoder.inverse_transform(data['RURAL'])
# data['OUTCOME'] = outcome_label_encoder.inverse_transform(data['OUTCOME'])
data
| AGE | GENDER | RURAL | DURATION OF STAY | OUTCOME | SMOKING | ALCOHOL | DM | HTN | CAD | ... | CONGENITAL | UTI | NEURO CARDIOGENIC SYNCOPE | ORTHOSTATIC | INFECTIVE ENDOCARDITIS | DVT | CARDIOGENIC SHOCK | SHOCK | PULMONARY EMBOLISM | CHEST INFECTION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 81 | 1 | 0 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 65 | 1 | 0 | 5 | 1 | 0 | 1 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 53 | 1 | 1 | 3 | 1 | 0 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 67 | 0 | 1 | 8 | 1 | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 60 | 0 | 1 | 23 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15748 | 74 | 0 | 1 | 2 | 1 | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15749 | 52 | 0 | 1 | 5 | 1 | 0 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15751 | 60 | 0 | 1 | 9 | 1 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15753 | 50 | 1 | 0 | 2 | 2 | 0 | 0 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15754 | 82 | 1 | 1 | 10 | 1 | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10924 rows × 49 columns
imputed_data.head()
| AGE | GENDER | RURAL | DURATION OF STAY | OUTCOME | SMOKING | ALCOHOL | DM | HTN | CAD | ... | CONGENITAL | UTI | NEURO CARDIOGENIC SYNCOPE | ORTHOSTATIC | INFECTIVE ENDOCARDITIS | DVT | CARDIOGENIC SHOCK | SHOCK | PULMONARY EMBOLISM | CHEST INFECTION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 81.0 | 1.0 | 0.0 | 3.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 65.0 | 1.0 | 0.0 | 5.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 53.0 | 1.0 | 1.0 | 3.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 67.0 | 0.0 | 1.0 | 8.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 60.0 | 0.0 | 1.0 | 23.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 49 columns
# Normalization goes here -- need more study to determine the normalization technique
# For now, MinMaxScaler of sklearn is used
from sklearn.preprocessing import normalize, MinMaxScaler
# normalized_arr = normalize(imputed_data[['GLUCOSE']])
# normalized_arr
scaler = MinMaxScaler()
glucose = scaler.fit_transform(imputed_data[['GLUCOSE']])
imputed_data['GLUCOSE'] = glucose
UREA = scaler.fit_transform(imputed_data[['UREA']])
imputed_data['UREA'] = UREA
PLATELETS = scaler.fit_transform(imputed_data[['PLATELETS']])
imputed_data['PLATELETS'] = PLATELETS
TLC = scaler.fit_transform(imputed_data[['TLC']])
imputed_data['TLC'] = TLC
HB = scaler.fit_transform(imputed_data[['HB']])
imputed_data['HB'] = HB
EF = scaler.fit_transform(imputed_data[['EF']])
imputed_data['EF'] = EF
CREATININE = scaler.fit_transform(imputed_data[['CREATININE']])
imputed_data['CREATININE'] = CREATININE
BNP = scaler.fit_transform(imputed_data[['BNP']])
imputed_data['BNP'] = BNP
imputed_data[["GLUCOSE", "UREA", "PLATELETS", "TLC", "HB", "EF", "CREATININE", "BNP"]]
| GLUCOSE | UREA | PLATELETS | TLC | HB | EF | CREATININE | BNP | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.097549 | 0.068499 | 0.285484 | 0.050972 | 0.342105 | 0.456522 | 0.054098 | 0.375500 |
| 1 | 0.137163 | 0.036169 | 0.125948 | 0.028353 | 0.563158 | 0.608696 | 0.054098 | 0.037390 |
| 2 | 0.230007 | 0.187715 | 0.278695 | 0.046512 | 0.400000 | 0.847826 | 0.144801 | 0.041233 |
| 3 | 0.159445 | 0.054354 | 0.242206 | 0.031220 | 0.515789 | 0.608696 | 0.034661 | 0.076301 |
| 4 | 0.176776 | 0.110932 | 0.021571 | 0.028672 | 0.557895 | 0.043478 | 0.076774 | 0.367494 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15748 | 0.160683 | 0.036573 | 0.316033 | 0.025677 | 0.508421 | 1.000000 | 0.033366 | 0.014412 |
| 15749 | 0.116118 | 0.054354 | 0.160741 | 0.023256 | 0.489474 | 1.000000 | 0.054098 | 0.057126 |
| 15751 | 0.273335 | 0.036169 | 0.052969 | 0.079325 | 0.268421 | 0.521739 | 0.028183 | 0.233387 |
| 15753 | 0.305521 | 0.189735 | 0.120008 | 0.049379 | 0.536842 | 0.304348 | 0.112407 | 0.040432 |
| 15754 | 0.258480 | 0.135179 | 0.315185 | 0.036954 | 0.331579 | 0.391304 | 0.118886 | 0.223379 |
10924 rows × 8 columns
import matplotlib.pyplot as plt
import numpy as np
# Divide the ages into bins and group the data by bin
data['age_bin'] = pd.cut(data['AGE'], bins=[0, 30, 60, 90, 120], labels=['<30', '30-60', '60-90', '>90'])
age_groups = data.groupby('age_bin')['DURATION OF STAY'].max()
# Set the x-axis to the age bin labels and the y-axis to the maximum duration of stay
x = ['<30', '30-60', '60-90', '>90']
y = age_groups
# Create a bar chart
plt.bar(x, y, color = 'blue')
# Add a title and labels for the x and y axes
plt.title('Maximum Duration of Stay During Emergency by Age Group')
plt.xlabel('Age Group (years)')
plt.ylabel('Maximum Duration of Stay (in days)')
# Show the plot
plt.show()
import matplotlib.pyplot as plt
import numpy as np
# Divide the ages into bins and group the data by bin
age_groups.mean = data.groupby('age_bin')['DURATION OF STAY'].mean()
# Set the x-axis to the age bin labels and the y-axis to the average duration of stay
x = ['<30', '30-60', '60-90', '>90']
y = age_groups.mean
# Create a bar chart
plt.bar(x, y, color = 'blue')
# Add a title and labels for the x and y axes
plt.title('Average Duration of Stay During Emergency by Age Group')
plt.xlabel('Age Group (years)')
plt.ylabel(' Average Duration of Stay (in days)')
# Show the plot
plt.show()
df = pd.read_csv('./WQD7003-Data-Analytics/HDHI_Admission_data/HDHI_Admission_data.csv')
df.head()
| SNO | MRD No. | D.O.A | D.O.D | AGE | GENDER | RURAL | TYPE OF ADMISSION-EMERGENCY/OPD | month year | DURATION OF STAY | ... | CONGENITAL | UTI | NEURO CARDIOGENIC SYNCOPE | ORTHOSTATIC | INFECTIVE ENDOCARDITIS | DVT | CARDIOGENIC SHOCK | SHOCK | PULMONARY EMBOLISM | CHEST INFECTION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 234735 | 4/1/2017 | 4/3/2017 | 81 | M | R | E | Apr-17 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 234696 | 4/1/2017 | 4/5/2017 | 65 | M | R | E | Apr-17 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 3 | 234882 | 4/1/2017 | 4/3/2017 | 53 | M | U | E | Apr-17 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 234635 | 4/1/2017 | 4/8/2017 | 67 | F | U | E | Apr-17 | 8 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 234486 | 4/1/2017 | 4/23/2017 | 60 | F | U | E | Apr-17 | 23 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 56 columns
# Get Emergency only
df1 = df[df["TYPE OF ADMISSION-EMERGENCY/OPD"] == 'E']
df1.head()
| SNO | MRD No. | D.O.A | D.O.D | AGE | GENDER | RURAL | TYPE OF ADMISSION-EMERGENCY/OPD | month year | DURATION OF STAY | ... | CONGENITAL | UTI | NEURO CARDIOGENIC SYNCOPE | ORTHOSTATIC | INFECTIVE ENDOCARDITIS | DVT | CARDIOGENIC SHOCK | SHOCK | PULMONARY EMBOLISM | CHEST INFECTION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 234735 | 4/1/2017 | 4/3/2017 | 81 | M | R | E | Apr-17 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 234696 | 4/1/2017 | 4/5/2017 | 65 | M | R | E | Apr-17 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 3 | 234882 | 4/1/2017 | 4/3/2017 | 53 | M | U | E | Apr-17 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 234635 | 4/1/2017 | 4/8/2017 | 67 | F | U | E | Apr-17 | 8 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 234486 | 4/1/2017 | 4/23/2017 | 60 | F | U | E | Apr-17 | 23 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 56 columns
# Plot graph to observe the trending of total duration of patient stays during emergency
import matplotlib.pyplot as plt
import pandas as pd
# Extract the 'month year' column as a datetime object
df1['month year'] = pd.to_datetime(df['month year'], format='%b-%y')
# Set the x-axis to the month/year labels and the y-axis to the average duration of stay
x = df1['month year']
y = df1['DURATION OF STAY']
# Create a line plot
plt.plot(x, y,)
# Add a title and labels for the x and y axes
plt.title('Overall Trending Duration of Stay During Emergency')
plt.xlabel('Month-Year')
plt.ylabel('Duration of Stay (in days)')
# Show the plot
plt.show()
C:\Users\lalala\AppData\Local\Temp/ipykernel_11284/2989899079.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df1['month year'] = pd.to_datetime(df['month year'], format='%b-%y')
# Plot graph to observe the trending of average duration of patient stays during emergency
import matplotlib.pyplot as plt
import pandas as pd
# Extract the 'month year' column as a datetime object
df1['month year'] = pd.to_datetime(df['month year'], format='%b-%y')
# Group the data by the 'month year' column and calculate the mean duration of stay for each month
groups = df1.groupby('month year')['DURATION OF STAY'].mean()
# Set the x-axis to the month/year labels and the y-axis to the average duration of stay
x = groups.index
y = groups.values
# Create a line plot
plt.plot(x, y)
# Add a title and labels for the x and y axes
plt.title('Trending of Average Duration of Stay During Emergency')
plt.xlabel('Month-Year')
plt.ylabel('Average Duration of Stay (in days)')
# Show the plot
plt.show()
C:\Users\lalala\AppData\Local\Temp/ipykernel_11284/1471109266.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df1['month year'] = pd.to_datetime(df['month year'], format='%b-%y')
data.columns
Index(['AGE', 'GENDER', 'RURAL', 'DURATION OF STAY', 'OUTCOME', 'SMOKING',
'ALCOHOL', 'DM', 'HTN', 'CAD', 'PRIOR CMP', 'CKD', 'HB', 'TLC',
'PLATELETS', 'GLUCOSE', 'UREA', 'CREATININE', 'BNP',
'RAISED CARDIAC ENZYMES', 'EF', 'SEVERE ANAEMIA', 'ANAEMIA',
'STABLE ANGINA', 'ACS', 'STEMI', 'ATYPICAL CHEST PAIN', 'HEART FAILURE',
'HFREF', 'HFNEF', 'VALVULAR', 'CHB', 'SSS', 'AKI', 'CVA INFRACT',
'CVA BLEED', 'AF', 'VT', 'PSVT', 'CONGENITAL', 'UTI',
'NEURO CARDIOGENIC SYNCOPE', 'ORTHOSTATIC', 'INFECTIVE ENDOCARDITIS',
'DVT', 'CARDIOGENIC SHOCK', 'SHOCK', 'PULMONARY EMBOLISM',
'CHEST INFECTION', 'age_bin'],
dtype='object')
col_viz = ['SMOKING','ALCOHOL', 'DM', 'HTN', 'CAD', 'PRIOR CMP', 'CKD',
'RAISED CARDIAC ENZYMES', 'SEVERE ANAEMIA', 'ANAEMIA',
'STABLE ANGINA', 'ACS', 'STEMI', 'ATYPICAL CHEST PAIN', 'HEART FAILURE',
'HFREF', 'HFNEF', 'VALVULAR', 'CHB', 'SSS', 'AKI', 'CVA INFRACT',
'CVA BLEED', 'AF', 'VT', 'PSVT', 'CONGENITAL', 'UTI',
'NEURO CARDIOGENIC SYNCOPE', 'ORTHOSTATIC', 'INFECTIVE ENDOCARDITIS',
'DVT', 'CARDIOGENIC SHOCK', 'SHOCK', 'PULMONARY EMBOLISM',
'CHEST INFECTION']
import seaborn as sns
import matplotlib.pyplot as plt
for col in col_viz:
plt.figure()
sns.barplot(x=data[col], y=data['DURATION OF STAY'])
# Add a title to the plot
plt.title(f"DURATION OF STAY vs {col}")
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
# Compute correlations, excluding null values
corr = data.corr()
# Only show correlations that are more than 0.3. This correlation coefficient is used as it is considered as 'Moderate' association
corr = corr[(corr >= 0.3)]
# Plot a heatmap of the correlations
sns.set(rc = {'figure.figsize':(10,8)})
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='magma')
plt.title('Correlation of Features')
plt.xlabel('Features')
plt.ylabel('Features')
# Show the plot
plt.show()
This section will apply PCA to reduce number of attributes. </br> The first section will determine the Principal Component Index. </br> The second section will apply PCA to the dataset. </br>
# Principal Component Analysis (PCA)
# Determine the Principal Component Index
pca_imputed_data = imputed_data
# preparing for PCA
from sklearn.decomposition import PCA
pre_pca_x = imputed_data.drop(['DURATION OF STAY'], axis=1)
pre_pca_y = imputed_data['DURATION OF STAY']
pca = PCA()
# Determine transformed features
pca.fit_transform(pre_pca_x)
# Determine explained variance using explained_variance_ration_ attribute
exp_var_pca = pca.explained_variance_ratio_
# Cumulative sum of eigenvalues; This will be used to create step plot
# for visualizing the variance explained by each principal component.
cum_sum_eigenvalues = np.cumsum(exp_var_pca)
trim_exp_var_pca = exp_var_pca[0:15]
trim_cum_sum_eigenvalues = cum_sum_eigenvalues[0:15]
print(trim_cum_sum_eigenvalues)
# Create the visualization plot
plt.bar(range(0,len(trim_exp_var_pca)), trim_exp_var_pca, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(trim_cum_sum_eigenvalues)), trim_cum_sum_eigenvalues, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.grid(axis='y',color='grey',alpha=0.5)
plt.show()
[0.97984764 0.98218674 0.98431587 0.98606282 0.98744295 0.98873924 0.9898385 0.9908805 0.99183973 0.99274607 0.99357734 0.99432848 0.99502605 0.99566388 0.99605557]
# run PCA to execute feature reduction
pca = PCA(n_components=8)
pca_x = pca.fit_transform(pre_pca_x)
display(len(pca_x))
display(len(pca_x[0]))
print(pca.explained_variance_ratio_)
10924
8
[0.97984764 0.00233909 0.00212913 0.00174696 0.00138013 0.00129628 0.00109926 0.001042 ]
# from sklearn.model_selection import train_test_split
# # split the dataset, remove random_state if want completely random
# X_train, X_test, y_train, y_test = train_test_split(
# imputed_data.drop(['DURATION OF STAY'], axis=1), imputed_data['DURATION OF STAY'], test_size=0.2, random_state=0)
# print('X train shape:', X_train.shape)
# print('X test shape:', X_test.shape)
# print('y train shape:', y_train.shape)
# print('y test shape:', y_test.shape)
The dataset is split into 80% - 20% as a training and a testing set respectively for evaluating the model performance. Upon splitting the dataset, the sizes of the training and testing set are 8739 and 2185 respectively.
The following models are evaluated, with metrics including R-squared, mean absolute error, mean squared error and root mean squared error:
(1) Linear Regression
(2) Decision Tree Regression
(3) Random Forest Regression
(4) Support Vector Regression
(5) Bayesian Ridge Regression
(6) Gradient Boosting Regression
(7) Elastic Net Regression
(8) Light Gradient Boosting Machine Regression
(9) Extreme Gradient Boosting Regression
(10) K-Nearest Neighbors Regression
Cross validation technique will be applied to the best model to obtain optimal hyperparameters / configurations
# Import library for training - testing set split
from sklearn.model_selection import train_test_split
# Split dataset into 80-20
def split_data(data):
#test train split
X = data.drop(['DURATION OF STAY'], axis=1)
y = data['DURATION OF STAY'].values # Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42) # 80-20% split with random state (seed) of 42
print("-" * 80, "\nTraining set size: ", len(y_train), "\nTesting set size: ", len(y_test), "\n", "-" * 80)
return X_train, X_test, y_train, y_test
print("\nTraining vs Testing set:")
X_train_initial ,X_test_initial ,y_train_initial ,y_test_initial = split_data(imputed_data)
Training vs Testing set: -------------------------------------------------------------------------------- Training set size: 8739 Testing set size: 2185 --------------------------------------------------------------------------------
# Import libraries for modelling
import sklearn
from sklearn.linear_model import LinearRegression # Multiple Linear Regression, order of x = 1
from sklearn.tree import DecisionTreeRegressor # Decision Tree Regression
from sklearn.ensemble import RandomForestRegressor # Random Forest Regression
from sklearn.svm import SVR # Support Vector Regression
from sklearn.linear_model import BayesianRidge # Bayesian Ridge Regression
from sklearn.ensemble import GradientBoostingRegressor # Gradient Boosting Regression
from sklearn.linear_model import ElasticNet # Elastic Net Regression
from lightgbm import LGBMRegressor # Light Gradient Boosting Machine Regression
from xgboost.sklearn import XGBRegressor # Extreme Gradient Boosting Regression
from sklearn.neighbors import KNeighborsRegressor # K-Nearest Neighbors Regression
from sklearn import neighbors
from sklearn.model_selection import GridSearchCV # GridSearch to determine best hyperparameters for kNN
from sklearn import metrics # Metrics for model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures # Polynomial Linear Regression, non-linear terms
import pickle # For deployment
# Before proceed with fitting models, we determine the optimal k for kNN Regression
params = {'n_neighbors': list(range(1, 49))}
knn = neighbors.KNeighborsRegressor(n_jobs = -1)
model = GridSearchCV(knn, params, cv = 10)
model.fit(X_train_initial, y_train_initial)
model.best_params_
{'n_neighbors': 25}
# Models used:
# (1) Linear Regression
# (2) Decision Tree Regression
# (3) Random Forest Regression
# (4) Support Vector Regression
# (5) Bayesian Ridge Regression
# (6) Gradient Boosting Regression
# (7) Elastic Net Regression
# (8) Light Gradient Boosting Machine Regression
# (9) Extreme Gradient Boosting Regression
# (10) k-Nearest neighbors Regression
LinearR = LinearRegression(n_jobs=-1).fit(X_train_initial, y_train_initial) # (1)
DecisionT = DecisionTreeRegressor(random_state = 42).fit(X_train_initial, y_train_initial) # (2)
RandomForestR = RandomForestRegressor(random_state = 42, n_jobs=-1).fit(X_train_initial, y_train_initial) # (3)
SupportVectorR = SVR().fit(X_train_initial, y_train_initial) # (4)
BayesianR = BayesianRidge().fit(X_train_initial, y_train_initial) # (5)
GradientB = GradientBoostingRegressor(random_state = 42).fit(X_train_initial, y_train_initial) # (6)
ElasticN = ElasticNet(random_state = 42).fit(X_train_initial, y_train_initial) # (7)
LightGBM = LGBMRegressor(random_state = 42, n_jobs=-1).fit(X_train_initial, y_train_initial) # (8)
XGBR = XGBRegressor(random_state = 42, n_jobs=-1).fit(X_train_initial, y_train_initial) # (9)
kNearestN = KNeighborsRegressor(n_neighbors = 44, n_jobs=-1).fit(X_train_initial, y_train_initial) # (10)
models = [LinearR, DecisionT, RandomForestR, SupportVectorR, BayesianR, GradientB, ElasticN, LightGBM, XGBR, kNearestN]
def Regressions(models, x_train, x_test, y_train, y_test):
model_performance = pd.DataFrame()
for model in models:
regression_model = model
# Prediction using test set
y_pred = regression_model.predict(x_test)
# R-squared (Coefficient of determination)
# Proportion of the variation in the dependent variable that is predictable from the independent variable
r_squared = metrics.r2_score(y_test, y_pred)
# Mean Absolute Error (MAE)
# Average of the absolute values of individual prediction errors over all instances in the test set
mae = metrics.mean_absolute_error(y_test, y_pred)
# Mean Squared Error (MSE)
# Average squared difference between the predicted values and the actual value
mse = metrics.mean_squared_error(y_test, y_pred)
# Root Mean Squared Error (RMSE)
# Standard deviation of the residuals (prediction errors)
rmse = mse ** 0.5
# Storing results in dataframe
results = pd.DataFrame({
'Model': [str(type(model).__name__)],
'R-squared': [round(r_squared, 4)],
'Mean Absolute Error (MAE)': [round(mae, 4)],
'Mean Squared Error (MSE)': [round(mse, 4)],
'Root Mean Squared Error (RMSE)': [round(rmse, 4)]
})
model_performance = pd.concat([model_performance, results], ignore_index = True)
return model_performance
# Display results, sorted by coefficient of determination in descending order
model_performance = Regressions(models, X_train_initial, X_test_initial, y_train_initial, y_test_initial)
model_performance.sort_values('R-squared', ascending = False)
| Model | R-squared | Mean Absolute Error (MAE) | Mean Squared Error (MSE) | Root Mean Squared Error (RMSE) | |
|---|---|---|---|---|---|
| 2 | RandomForestRegressor | 0.2216 | 2.8450 | 19.3453 | 4.3983 |
| 7 | LGBMRegressor | 0.2070 | 2.9644 | 19.7086 | 4.4394 |
| 8 | XGBRegressor | 0.1543 | 3.0247 | 21.0174 | 4.5845 |
| 5 | GradientBoostingRegressor | 0.1190 | 3.0650 | 21.8945 | 4.6792 |
| 4 | BayesianRidge | 0.0773 | 3.2063 | 22.9332 | 4.7889 |
| 0 | LinearRegression | 0.0763 | 3.2064 | 22.9573 | 4.7914 |
| 9 | KNeighborsRegressor | 0.0410 | 3.2447 | 23.8334 | 4.8819 |
| 6 | ElasticNet | 0.0059 | 3.4045 | 24.7072 | 4.9706 |
| 3 | SVR | -0.0243 | 3.2493 | 25.4576 | 5.0456 |
| 1 | DecisionTreeRegressor | -0.5801 | 3.8011 | 39.2706 | 6.2666 |
From the results above, for the present problem, the best performing model is Random Forest with the highest R-squared value (Coefficient of Determination) and least root mean squared error (RMSE).
However, only about 22.16% of the variation in patients' duration of stay can be explained by the independent variables in the dataset by using Random Forest Regressor.
To improve model performance, we perform GridSearch with cross validation on Random Forest model to obtain the best hyperparameters.
By doing so, the accuracy / loss for every combination of hyperparameters are computed and we can select the one with the best performance.
# Perform GridSearch with cross validation on Random Forest model to obtain best hyperparameters, then re-evaluate
# As the GridSearch takes up a lot of compute power and time, the number of parameter is reduced to avoid running for too long
params = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 4, None],
'max_leaf_nodes': [10, 20, None]
}
search = GridSearchCV(RandomForestRegressor(n_jobs = -1), params, cv = 10)
RF_Search_Result = search.fit(X_train_initial, y_train_initial)
print("Best configurations of Random Forest Model:\n", RF_Search_Result.best_params_)
print("\nAverage cross-validated score of the best_estimator: ", RF_Search_Result.best_score_)
Best configurations of Random Forest Model:
{'max_depth': None, 'max_leaf_nodes': None, 'n_estimators': 300}
Average cross-validated score of the best_estimator: 0.2171994421765792
# "Best" model evaluation on the testing dataset
RF_Best_Model = RF_Search_Result.best_estimator_
y_test_pred = RF_Best_Model.predict(X_test_initial)
r_squared_RF = metrics.r2_score(y_test_initial, y_test_pred)
mae_RF = metrics.mean_absolute_error(y_test_initial, y_test_pred)
mse_RF = metrics.mean_squared_error(y_test_initial, y_test_pred)
rmse_RF = mse_RF ** 0.5
print('R-squared\t\t\t: ', round(r_squared_RF, 4))
print('Mean Absolute Error (MAE)\t: ', round(mae_RF, 4))
print('Mean Squared Error (MSE)\t: ', round(mse_RF, 4))
print('Root Mean Squared Error (RMSE)\t: ', round(rmse_RF, 4))
R-squared : 0.2263 Mean Absolute Error (MAE) : 2.8394 Mean Squared Error (MSE) : 19.2288 Root Mean Squared Error (RMSE) : 4.3851
# Save the trained model with best performance to disk
filename = 'RFmodel.pkl'
pickle.dump(RF_Best_Model, open(filename, 'wb'))
# Load the model from disk, for future deployment / predictions
loaded_model = pickle.load(open(filename, 'rb'))
As presented in the model performance table, the highest R-squared value of all models trained was 0.2216. The low performance of these models might indicate that there is no relationship between the features (x) and patients's duration of stay (y) in the data, or there is a non-linear relationship between x and y.
To identify if there is a non-linear relationship between x and y, polynomial linear regression algorithm is applied to train the data, and obtain the metric scores. This is performed by identifying the optimal degree of independent variables (x) with an elbow plot, and train the model with the obtained best degree.
The output of the notebook is execute with papermill, it is then deploy to Github Pages for web hosting. Below is the link to deployed page:
!pip install papermill
Requirement already satisfied: papermill in c:\users\lalala\anaconda3\lib\site-packages (2.4.0) Requirement already satisfied: entrypoints in c:\users\lalala\anaconda3\lib\site-packages (from papermill) (0.3) Requirement already satisfied: click in c:\users\lalala\anaconda3\lib\site-packages (from papermill) (8.0.3) Requirement already satisfied: nbformat>=5.1.2 in c:\users\lalala\anaconda3\lib\site-packages (from papermill) (5.1.3) Requirement already satisfied: tenacity in c:\users\lalala\anaconda3\lib\site-packages (from papermill) (8.1.0) Requirement already satisfied: requests in c:\users\lalala\anaconda3\lib\site-packages (from papermill) (2.26.0) Requirement already satisfied: tqdm>=4.32.2 in c:\users\lalala\anaconda3\lib\site-packages (from papermill) (4.62.3) Requirement already satisfied: nbclient>=0.2.0 in c:\users\lalala\anaconda3\lib\site-packages (from papermill) (0.5.3) Requirement already satisfied: pyyaml in c:\users\lalala\anaconda3\lib\site-packages (from papermill) (6.0) Requirement already satisfied: ansiwrap in c:\users\lalala\anaconda3\lib\site-packages (from papermill) (0.8.4) Requirement already satisfied: traitlets>=4.2 in c:\users\lalala\anaconda3\lib\site-packages (from nbclient>=0.2.0->papermill) (5.1.0) Requirement already satisfied: async-generator in c:\users\lalala\anaconda3\lib\site-packages (from nbclient>=0.2.0->papermill) (1.10) Requirement already satisfied: jupyter-client>=6.1.5 in c:\users\lalala\anaconda3\lib\site-packages (from nbclient>=0.2.0->papermill) (6.1.12) Requirement already satisfied: nest-asyncio in c:\users\lalala\anaconda3\lib\site-packages (from nbclient>=0.2.0->papermill) (1.5.1) Requirement already satisfied: tornado>=4.1 in c:\users\lalala\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient>=0.2.0->papermill) (6.1) Requirement already satisfied: python-dateutil>=2.1 in c:\users\lalala\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient>=0.2.0->papermill) (2.8.2) Requirement already satisfied: pyzmq>=13 in c:\users\lalala\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient>=0.2.0->papermill) (22.2.1) Requirement already satisfied: jupyter-core>=4.6.0 in c:\users\lalala\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient>=0.2.0->papermill) (4.8.1) Requirement already satisfied: pywin32>=1.0 in c:\users\lalala\anaconda3\lib\site-packages (from jupyter-core>=4.6.0->jupyter-client>=6.1.5->nbclient>=0.2.0->papermill) (228) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\lalala\anaconda3\lib\site-packages (from nbformat>=5.1.2->papermill) (3.2.0) Requirement already satisfied: ipython-genutils in c:\users\lalala\anaconda3\lib\site-packages (from nbformat>=5.1.2->papermill) (0.2.0) Requirement already satisfied: attrs>=17.4.0 in c:\users\lalala\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=5.1.2->papermill) (21.2.0) Requirement already satisfied: setuptools in c:\users\lalala\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=5.1.2->papermill) (58.0.4) Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\lalala\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=5.1.2->papermill) (0.18.0) Requirement already satisfied: six>=1.11.0 in c:\users\lalala\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=5.1.2->papermill) (1.16.0) Requirement already satisfied: colorama in c:\users\lalala\anaconda3\lib\site-packages (from tqdm>=4.32.2->papermill) (0.4.4) Requirement already satisfied: textwrap3>=0.9.2 in c:\users\lalala\anaconda3\lib\site-packages (from ansiwrap->papermill) (0.9.2) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\lalala\anaconda3\lib\site-packages (from requests->papermill) (1.26.7) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\lalala\anaconda3\lib\site-packages (from requests->papermill) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\lalala\anaconda3\lib\site-packages (from requests->papermill) (3.2) Requirement already satisfied: certifi>=2017.4.17 in c:\users\lalala\anaconda3\lib\site-packages (from requests->papermill) (2021.10.8)
# The result of the ipynb is render and save with papermill
import papermill as pm
pm.execute_notebook(
'WQD7003_DATA_ANALYTICS_Group_5.ipynb',
'WQD7003_output.ipynb'
)
As part of the project completion, below deliverables are produced for project close out:
The team also conduct restrospective session internally to evaluate the overall experience in delivering the project by capturing the key success stories, main challenges and further improvement that can be made in the future:
1. What went well
2. What are the main challenges
3. What can be improved